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Abstract 

We study the problem of learning a sparse linear regression vector under additional con- 
ditions on the structure of its sparsity pattern. This problem is relevant in machine learn- 
ing, statistics and signal processing. It is well known that a linear regression can benefit 
from knowledge that the underlying regression vector is sparse. The combinatorial prob- 
lem of selecting the nonzero components of this vector can be "relaxed" by regularizing 
the squared error with a convex penalty function like the l\ norm. However, in many ap- 
plications, additional conditions on the structure of the regression vector and its sparsity 
pattern are available. Incorporating this information into the learning method may lead to a 
significant decrease of the estimation error. 

In this paper, we present a family of convex penalty functions, which encode prior 
knowledge on the structure of the vector formed by the absolute values of the regression 
coefficients. This family subsumes the l\ norm and is flexible enough to include different 
models of sparsity patterns, which are of practical and theoretical importance. We establish 
the basic properties of these penalty functions and discuss some examples where they can be 
computed explicitly. Moreover, we present a convergent optimization algorithm for solving 
regularized least squares with these penalty functions. Numerical simulations highlight the 
benefit of structured sparsity and the advantage offered by our approach over the Lasso 
method and other related methods. 



1 Introduction 



The problem of sparse estimation is becoming increasing important in statistics, machine learn- 
ing and signal processing. In its simplest form, this problem consists in estimating a regression 
vector (3* G W 1 from a set of linear measurements y G M m , obtained from the model 

y = X(3*+i (1.1) 

where X is an m x n matrix, which may be fixed or randomly chosen and £ G M. m is a vector 
which results from the presence of noise. 

An important rational for sparse estimation comes from the observation that in many practi- 
cal applications the number of parameters n is much larger than the data size m, but the vector 
f3* is known to be sparse, that is, most of its components are equal to zero. Under this sparsity 
assumption and certain conditions on the data matrix X, it has been shown that regulariza- 
tion with the l\ norm, commonly referred to as the Lasso method Ii27[[ . provides an effective 



means to estimate the underlying regression vector, see for example [|5|, LZL LLSL |28Q and refer- 
ences therein. Moreover, this method can reliably select the sparsity pattern of (3* II 1 311 . hence 
providing a valuable tool for feature selection. 

In this paper, we are interested in sparse estimation under additional conditions on the spar- 
sity pattern of the vector (3*. In other words, not only do we expect this vector to be sparse but 
also that it is structured sparse, namely certain configurations of its nonzero components are 
to be preferred to others. Thisproblem arises is several applications, ranging from functional 
magnetic resonance imaging 191 129(1. to scene recognition in vision IllOn . to multi-task learning 
[QllliLllS] and to bioinformatics [26], see 1 14] for a discussion. 

The prior knowledge that we consider in this paper is that the vector \{3*\, whose compo- 
nents are the absolute value of the corresponding components of (3*, should belong to some 
prescribed convex subset A of the positive orthant. For certain choices of A this implies a con- 
straint on the sparsity pattern as well. For example, the set A may include vectors with some 
desired monotonicity constraints, or other constraints on the "shape" of the regression vector. 
Unfortunately, the constraint that \(3*\ G A is nonconvex and its implementation is computa- 
tional challenging. To overcome this difficulty, we propose a family of penalty functions, which 
are based on an extension of the ii norm used by the Lasso method and involves the solution 
of a smooth convex optimization problem. These penalty functions favor regression vectors (3 
such that |/3 1 G A, thereby incorporating the structured sparsity constraints. 

Precisely, we propose to estimate f3* as a solution of the convex optimization problem 

min{\\X(3 -y\\l + 2 P n(/3\A) : f3 G R n } (1.2) 

where || • || 2 denotes the Euclidean norm, p is a positive parameter and the penalty function takes 
the form 



fi(0|A) = inf{± E(f + A >) :AgA 



As we shall see, a key property of the penalty function is that it exceeds the £\ norm of 
(3 when \(3\ ^ A, and it coincides with the l\ norm otherwise. This observation suggests a 
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heuristic interpretation of the method (11.21) : among all vectors j3 which have a fixed value of 
the t\ norm, the penalty function will encourage those for which G A. Moreover, when 
G A the function £1 reduces to the t\ norm and, so, the solution of problem (11.21 ) is expected 
to be sparse. The penalty function therefore will encourage certain desired sparsity patterns. 
Indeed, the sparsity pattern of f3 is contained in that of the auxiliary vector A at the optimum 
and, so, if the set A allows only for certain sparsity patterns of A, the same property will be 
"transferred" to the regression vector (3. 

There has been some recent research interest on structured sparsity, see [[ill 13 . 14, 19, 



22l 130. l31[| and references therein. Closest to our approach are penalty methods built around 



the idea of mixed norms. In particular, the group Lasso method [31] assumes that the 
components of the underlying regression vector f3* can be partitioned into prescribed groups, 
such that the restriction of (3* to a group is equal to zero for most of the groups. This idea has 
been extended in IbH 32] by considering the possibility that the groups overlap according to 
certain hierarchical or spatially related structures. Although these methods have proved valuable 
in applications, they have the limitation that they can only handle more restrictive classes of 
sparsity, for example patterns forming only a single connected region. Our point of view is 
different from theirs and provides a means to designing more flexible penalty functions which 
maintain convexity while modeling richer model structures. For example, we will demonstrate 
that our family of penalty functions can model sparsity patterns forming multiple connected 
regions of coefficients. 

The paper is organized in the following manner. In Section [2] we establish some important 
properties of the penalty function. In Section [3] we address the case in which the set A is 
a box. In Section @] we derive the form of the penalty function corresponding to the wedge 
with decreasing coordinates and in Section [5] we extends this analysis to the case in which the 
constraint set A is constructed from a directed graph. In Section [6] we discuss useful duality 
relations and in Section [7] we address the issue of solving the problem (11.21 ) numerically by 
means of an alternating minimization algorithm. Finally, in Section [8] we provide numerical 
simulations with this method, showing the advantage offered by our approach. 

A preliminary version of this paper appeared in the proceedings of the Twenty-Fourth An- 
nual Conference on Neural Information Processing Systems (NIPS 2010) [21]. The new version 
contains Propositions 12.11 [231 and [2~4l the description of the graph penalty in Section[51 Section 
[6l a complete proof of Theorem l7.1l and an experimental comparison with the method of 



2 Penalty function 

In this section, we provide some general comments on the penalty function which we study in 
this paper. 

We first review our notation. We denote with K + and R ++ the nonnegative and positive real 
line, respectively. For every (3 G M 71 we define \(3\ G K™ to be the vector formed by the absolute 
values of the components of (3, that is, \/3\ = (\(3i\ : i G N n ), where N n is the set of positive 
integers up to and including n. Finally, we define the t x norm of vector f3 as = X]jeN„ 

and the i 2 norm as \\(3\\ 2 = Jj2 i& N n 
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Given an m x n input data matrix X and an output vector y G M. m , obtained from the linear 
regression model y = X(3* + £ discussed earlier, we consider the convex optimization problem 

inf { \\Xf3 -y\\l + 2p r(/3, A) : (3 e R n , A e A} (2.1) 

where p is a positive parameter, A is a prescribed convex subset of the positive orthant E" + and 
the function T : IR n x M" + — y R is given by the formula 

r<AA) = + 

Note that in (|2.1I) . for a fixed /3 G M n , the infimum over A = (A; : z G N n ) in general is not 
attained, however, for a fixed A G A, the infimum over (3 is always attained. 

Since the auxiliary vector A appears only in the second term of the objective function of 
problem (12.11) . and our goal is to estimate j3*, we may also directly consider the regularization 
problem 

min {\\Xf3 - y\\l + 2pH(/3|A) : f3 G R n } , (2.2) 
where the penalty function takes the form 

Q(/3|A) = inf {r(/3, A) : A G A} . (2.3) 

Note that T is convex on its domain because each of its summands are likewise convex functions. 
Hence, when the set A is convex it follows that 0(- |A) is a convex function and (12.21 ) is a convex 
optimization problem. 

An essential idea behind our construction of the penalty function is that, for every A G K++> 
the quadratic function T(-,A) provides a smooth approximation to \(3\ from above, which is 
exact at (3 = ±A. We indicate this graphically in Figure [Q-a. This fact follows immediately by 
the arithmetic-geometric mean inequality, which states, for every a, b > that (a + b)/2 > \fab~. 

A special case of the formulation (12.21 ) with A = WL+ is the Lasso method ll2~7ll . which is 
defined to be a solution of the optimization problem 

mm{\\y-XP\\l + 2 P m\ 1 :PeR n }. 

Indeed, using again the arithmetic-geometric mean inequality it follows that = \\/3\\ i. 

Moreover, if for every i G N n fa ^ 0, then the infimum is attained for A = \/3\. This important 
special case motivated us to consider the general method described above. The utility of (12.31) 
is that upon inserting it into (12.21) there results an optimization problem over A and (3 with 
a continuously differentiable objective function. Hence, we have succeeded in expressing a 
nondifferentiable convex objective function by one which is continuously differentiable on its 
domain. 

Our first observation concerns the differentiability of Vl. In this regard, we provide a suf- 
ficient condition which ensures this property of f2, which, although seemingly cumbersome 
covers important special cases. To present our result, for any real numbers a < b, we define the 
parallelepiped [a, b] n = {x : x = (xi : i G N n ), a < Xi < b, i G N n }. 
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Definition 2.1. We say that the set A is admissible if it is convex and, for all a,i 6 I with 

< a < b, the set A ab := [a, b] n Pi A is a nonempty, compact subset of the interior of A. 

Proposition 2.1. If (3 G (M\{0}) n and A is an admissible subset of W^ + , then the infimum above 
is uniquely achieved at a point A(/3) G A and the mapping (5 i— > A(/3) is continuous. Moreover, 
the function f2(-|A) w continuously differentiate and its partial derivatives are given, for any 

1 G Nn, fry the formula 



We postpone the proof of this proposition to the appendix. We note that, since fi(-|A) is 
continuous, we may compute it at a vector (3, some of whose components are zero, as a limiting 
process. Moreover, at such a vector the function f2(-|A) is in general not differentiable, for 
example consider the case Q(j3 \ Wl + ) = 

The next proposition provides a justification of the penalty function as a means to incorpo- 
rate structured sparsity and establish circumstances for which the penalty function is a norm. 
To state our result, we denote by A the closure of the set A. 

Proposition 2.2. For every (3 G W 1 , we have that < £l(/3\A) and the equality holds if and 
only if |/3 1 := : % G N n ) G A. Moreover, if A is a nonempty convex cone then the function 
f2(-|A) is a norm and we have that f2(/3|A) < cj||/3||i, where uj := max{f2(e/c|A) : k G N n } and 
{efc : k G N n } is the canonical basis ofM. n . 

Proof. By the arithmetic-geometric mean inequality we have that \\/3\\i < T(f3, A), proving the 
first assertion. If \(3\ G A, there exists a sequence {A fc : k G N} in A, such that lim^oo A fc = \/3\. 
Since Q(/3|A) < T(/3, X k ) it readily follows that fi(/3|A) < Conversely, if \0\ G A, then 

there is a sequence {A fe : k G N} in A, such that T((3, \ k ) < +l/k. This inequality implies 
that some subsequence of this sequence converges to a A G A. Using arithmetic-geometric mean 
inequality we conclude that A = \(3\ and the result follows. To prove the second part, observe 
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that if A is a nonempty convex cone, namely, for any A G A and t > it holds that tX G A, 
we have that ft is positive homogeneous. Indeed, making the change of variable A' = A/|t| we 
see that Q(t(3\A) = |t|ft(/?|A). Moreover, the above inequality, ft(/3|A) > \\f3\\i, implies that if 
ft(/3| A) = then (3 = 0. The proof of the triangle inequality follows from the homogeneity and 
convexity of ft, namely ft(a + f3\A) = 2ft ((a + /3)/2\A) < ft(a|A) + ft(/3|A). 

Finally, note that ft (/3 1 A) < u;||/3||i if and only if u = max{ft(/3|A) : 1| ! = 1}. Since ft 
is convex the maximum above is achieved at an extreme point of the £\ unit ball. ■ 

This proposition indicates a heuristic interpretation of the method (12.21) : among all vectors f3 
which have a fixed value of the £\ norm, the penalty function ft will encourage those for which 
|/3 1 G A. Moreover, when G A the function ft reduces to the l\ norm and, so, the solution 
of problem (|2.2I) is expected to be sparse. The penalty function therefore will encourage certain 
desired sparsity patterns. 

The last point can be better understood by looking at problem (I2.ll ). For every solution 
((3, A), the sparsity pattern of (3 is contained in the sparsity pattern of A, that is, the indices 
associated with nonzero components of (3 are a subset of those of A. Indeed, if = it 
must hold that = as well, since the objective would diverge otherwise (because of the ratio 
(3f jXi). Therefore, if the set A favors certain sparse solutions of A, the same sparsity pattern will 
be reflected on (3. Moreover, the XlieN n ^» term appearing in the expression for T((3, A) favors 
sparse A vectors. For example, a constraint of the form Ai > • • • > A n favors consecutive zeros 
at the end of A and nonzeros everywhere else. This will lead to zeros at the terminal components 
of (3 as well. Thus, in many cases like this, it is easy to incorporate a convex constraint on A, 
whereas it may not be possible to do the same with (3. 

Next, we note that a normalized version of the group Lasso penalty Bill is included in our 
setting as a special case. If, for some k G N n , { Je : £ G N k } forms a partition of the index set 
N„, the corresponding group Lasso penalty is defined as 

n G iW) = Y,V\M\\P\Jth, ( 2 -5) 

where, for every J C N n , we use the notation f3\j = (f3j : j G J). It is an easy matter to verify 
that ft GL = ft(-|A) for A = {A : A G M^ + , Xj = 6 e , j G J e , £ G N k , 6 e > 0}. 

The next proposition presents a useful construction which may be employed to generate 
new penalty functions from available ones. It is obtained by composing a set 9 C with 
a linear transformation, modeling the sum of the components of a vector, across the elements 
of a prescribed partition V = {Pi : £ G Nk} of N n . To describe our result we introduce 
the group average map Ap : W l — > M fe induced by V. It is defined, for each (3 G W 1 , as 
A v ((3) = (\\(3 lPe \\ 1 :£eN k ). 

Proposition 2.3. //0C (3 G W 1 and V is a partition ofN n then 

ft(/3|A p 1 (6)) = ft(A P (/3)|e). 



5 



Proof. The idea of the proof depends on two basic observations. The first uses the set theoretic 
formula 

Aj\Q) = |J Aj\9). 
eee 

From this decomposition we obtain that 

Q{(3\Aj\e)) = inf {inf {T{f3, A) : A G Aj 1 (9)} : 9 E 6} . (2.6) 
Next, we write 9 = (0£ : £ E Nk) E and decompose the inner infimum as the sum 

E inf \\ E (y + x >) ■ E ^ ■ = ^ > o, J e J, 1 . 

Now, the second essential step in the proof evaluates the infimum in the second sum by the 
Cauchy-Schwarz inequality to obtain that 

inf (r(/3|A) : A E Aj\9)} = £ \ + 6 t ) . 

We now substitute this formula into the right hand side of equation (12.61) to finish the proof. ■ 

When the set A is a nonempty convex cone, to emphasize that the function 1 A) is a norm 
we denoted it by || • ||a- We end this section with the identification of the dual norm of || • ||a, 
which is defined as 

|*,A = m ax {/3 T u : u E E n , ||m||a = 1} ■ 



Proposition 2.4. If his a nonempty convex cone then there holds the equation 



* )A = sup \ \ / ^ — : A E A 



Proof. By definition, cp = \\/3\\*,a is the smallest constant cp such that, for every A E A and 
uGK", it holds that 



Minimizing the left hand side of this inequality for u E W l yields the equivalent inequality 

Since this inequality holds for every A E A, the result follows by taking the supremum of the 
right hand side of the above inequality over this set. ■ 
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The formula for the dual norm suggests that we introduce the set A = { A : A G A, X]jeN„ ^» 
1}. With this notation we see that the dual norm becomes 



Moreover, a direct computation yields an alternate form for the original norm given by the 
equation 

3 Box penalty 

We proceed to discuss some examples of the set A C , which may be used in the design of 
the penalty function fi(-|A). 

The first example, which is presented in this section, corresponds to the prior knowledge 
that the magnitude of the components of the regression vector should be in some prescribed 
intervals. We choose a = (a* : i G N n ), b = (6j : i G N n ) G M", < a, L < hi and define the 
corresponding box as B [a, b] := {(Aj : i G N n ) : Aj G [a^frj], i G N n }. The theorem below 
establishes the form of the box penalty. To state our result, we define, for every t G K., the 
function t + = max(0, t). 

Theorem 3.1. We have that 

mB[a, b]) = H/3IK + - W+ + idAl - ■ 

Moreover, the components of the vector A(/3) := argmin{r(/3, A) : A G B[a, b}} are given by 
the equations A»(/9) = |A| + (a* - - - b)+, i G M„. 

Proof. Since Q(/3\B[a, b]) = J2i&n n ^(A|[ Q i> ^]) ^ suffices to establish the result in the case 
n = 1. We shall show that if a, b, (3 G M, a < & then 

n(0|[a, 6]) = + i(a - + Iflfl - 6) 2 + . (3.1) 

Since both sides of the above equation are continuous functions of (3 it suffices to prove this 
equation for G M\{0}. In this case, the function T(f3, •) is strictly convex, and so, has a 
unique minimum in M ++ at A = \/3\, see also Figure [T]-b. Moreover, if \(3\ < a the minimum 
occurs at A = a, whereas if \(3\ > b, it occurs at A = b. This establishes the formula for A(/3). 
Consequently, we have that 

if \/3\ G [a, b] 

fl(/3|[a,6])= <j U^ + a), if < a 

, if > b. 

Equation (13.11) now follows by a direct computation. ■ 
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Jl ={1} 



J 2 = {2,3,4,5} 



J 3 ={6,7} 




We also refer to lll2l 12411 for related penalty functions. Note that the function in equation 
(13.11) is a concatenation of two quadratic functions, connected together with a linear function. 
Thus, the box penalty will favor sparsity only for a = 0, case that is defined by a limiting 
argument. 



4 Wedge penalty 

In this section, we consider the case that the coordinates of the vector A 6 A are ordered in a 
nonincreasing fashion. As we shall see, the corresponding penalty function favors regression 
vectors which are likewise nonincreasing. 
We define the wedge 

W = {A : A = (Ai : i E N n ) E + , \ > i E N n _x}. 

Our next result describes the form of the penalty Vt in this case. To explain this result we require 
some preparation. We say that a partition J = {Ji : £ E Nk} of N n is contiguous if for all 
i E Je,j E Jt+i, £ E Nfc_i, it holds that i < j. For example, if n = 3, partitions {{1, 2}, {3}} 
and {{1}, {2}, {3}} are contiguous but {{1, 3}, {2}} is not. 

Definition 4.1. Given any two disjoint subsets J,KE\f^ n we define the region in R™ 

Qj, K ={fi-.PEw}^>^y (4.D 

Note that the boundary of this region is determined by the zero set of a homogeneous polynomial 
of degree two. We also need the following construction. 

Definition 4.2. For every S C N n _i we set k = \S\ + 1 and label the elements ofS in increasing 
order as S = {je : £ E Nfc_i}. We associate with the set S a contiguous partition ofN n , given 
by J(S) = {Je : £ E N^}, where we define Je := + 1, je] fl N n , £ E Nk, and set j = 
and jk = n. 

Figure [2] illustrates an example of a contiguous partition along with the set J(S). 

A subset S of N n _! also induces two regions in W 1 which play a central role in the identi- 
fication of the wedge penalty. First, we describe the region which "crosses over" the induced 
partition J(S). This is defined to be the set 

S :=f){Qj e ,J e+1 :£EN k ^}. (4.2) 
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In other words, (3 G Os if the average of the square of its components within each region J e 
strictly decreases with t. The next region which is essential in our analysis is the "stays within" 
region, induced by the partition J{S). This region is defined as 

Is^r\{Qj e ,j^:qeJ e JeN k } (4.3) 

where Q denotes the closure of the set Q and we use the notation J t q := {j : j G Je, j < q}. In 
other words, all vectors (3 within this region have the property that, for every set Je G J(S), the 
average of the square of a first segment of components of (3 within this set is not greater than 
the average over Je. We note that if S is the empty set the above notation should be interpreted 

as O s = M n and 

Is = fK^N n! N g = Q e N n }. 
From the cross-over and stay-within sets we define the region 

p s = o s n i s . 

Alternatively, we shall describe below the set P s in terms of two vectors induced by a vector 
(3 G R™ and the set S C N n _i. These vectors play the role of the Lagrange multiplier and the 
minimizer A for the wedge penalty in the theorem below. 

Definition 4.3. For every vector (3 G (M\{0}) n and every subset S C N n _i we let J(S) be the 
induced contiguous partition ofN n and define two vectors ((f3, S) G IR™ +1 and S(j3, S) G + 
by 

( 0, if qe SU{0,n}, 

{ I J A ? |-I j ^' if qeJtJeN k 

and 

< j,(/3,S') = M^, q e.J e ,£eN k . (4.4) 

Note that the components of 8(/3, S) are constant on each set J e , £ G N k . 
Lemma 4.1. For every (3 G (M\{0}) n and S C N fc _i we /zave ?/za? 

faj /3 G P 5 if and only if ((/3, S) > anJ <J(/3, 5) G int(W); 

f&J If5(/3, = 5(f3, S 2 ) and (3 G Os 1 n 52 ^en 5 X = 5 2 . 

Proof. The first assertion follows directly from the definition of the requisite quantities. The 
proof of the second assertion is a direct consequence of the fact that the vector 5(f3, S) is a 
constant on any element of the partition J(S) and strictly decreasing from one element to the 
next in that partition. ■ 
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For the theorem below we introduce, for every S G the sets 

U s := P s H (M\{0}) n . 

We shall establishes not only that the collection of sets U := {Us : S C N„_i} form a partition 
of (M\{0}) n , that is, their union is (IR\{0}) ?1 and two distinct elements of U are disjoint, but 
also explicitly determine the wedge penalty on each element of U. 

Theorem 4.1. The collection of sets U := {Us : 5 C N n -i} form a partition of(R\{0}) n . For 
each (3 G (M\{0}) n there is a unique S C N n _i smc/i ?/za? /3 G Ws, anJ 




(4.5) 



where k = \S\ + 1. Moreover, the components of the vector A(/3) := argmin{r(/3, A) : A G 
are gz'ven fry ?/ze equations \j(/3) = fi£, j 6 if, £ G where 

H = — 7=r- ( 4 -«) 



Proof. First, let us observe that there are n — 1 inequality constraints defining H 7 . It readily 
follows that all vectors in this constraint set are regular, in the sense of optimization theory, see 
fl, p. 279]. Hence, we can appeal to [4, Prop. 3.3.4, p. 316 and Prop. 3.3.6, p. 322], which 
state that A G Wt , is a solution to the minimum problem determined by the wedge penalty, if 
and only if there exists a vector a = (aj : i G N n _i) with nonnegative components such that 



£ 2 

A 



+ 1 + 0^1 -«,• = (), j G N n , (4.7) 



where we set a = a n = 0. Furthermore, the following complementary slackness conditions 
hold true 

aj {X j+1 - Xj) = 0, j G N n _i. (4.8) 

To unravel these equations, we let S := {j : Aj > A J+ i, j G which is the subset of 

indexes corresponding to the constraints that are not tight. When k > 2, we express this set in 
the form {jg : £ G where k = \S\ + 1. As explained in Definition 14.21 the set S induces 

the partition J(S) = {Jg : £ G N^} of N n . When k = 1 our notation should be interpreted 
to mean that S is empty and the partition J(S) consists only of N n . In this case, it is easy to 
solve equations (14.71 ) and (14.81) . In fact, all components of the vector A have a common value, 
say fi > 0, and by summing both sides of equation (|4.7I) over j G N n we obtain that 

fi — 

n 

Moreover, summing both sides of the same equation over j G N q we obtain that 

a q = 2 — + 1 

p? 
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and, since a q > we conclude that (3 G I s = P§. 

We now consider the case that k > 2. Hence, the vector A has equal components on each 
subset J i, which we denote by £ G Nk-i- The definition of the set S implies that the 
sequence : £ G is strictly decreasing and equation (14.81 ) implies that <x, = 0, for every 
j G S. Summing both sides of equation (14.71) over j G Je we obtain that 

-i£^ + W=° (4-9) 

^ jeJ t 

from which equation (14.61) follows. Since the fii are strictly decreasing, we conclude that (3 G 
O s . Moreover, choosing q G Je and summing both sides of equations (14.71) over j G Jt yQ we 
obtain that 

n s IIA^,, II2 1 I t I 
< a q = ^ h \ Je, q \ 

n 

which implies that /3 G Qj it j eq - Since this holds for every q G Je and £ G N k we conclude that 
G I§ and therefore, it follows that G Us- 

In summary, we have shown that a = (((3, S), A = 5((3, S), and (3 G In particular, this 
implies that the collection of sets U covers (M\{0}) n . Next, we show that the elements of U are 
disjoint. To this end, we observe that, the computation described above can be reversed. That 
is to say, conversely for any S C N„_i and (3 G U s we conclude that 5(f3, S) and S) solve 
the equations (14.71 ) and (14.81) . Since the wedge penalty function is strictly convex we know that 
equations (14.71) and (14.81) have a unique solution. Now, if (3 G ET^ D C/g a then it must follow that 
5((3, Si) = 5(13, S 2 ). Consequently, by part (b) in Lemma l4Tl we conclude that S\ = S 2 . ■ 

Note that the set S and the associated partition J appearing in the theorem is identified 
by examining the optimality conditions of the optimization problem (12.31) for A = W. There 
are 2 n_1 possible partitions. Thus, for a given (3 G (M\{0}) n , determining the corresponding 
partition is a challenging problem. We explain how to do this in Section |7J 

An interesting property of the Wedge penalty, which is indicated by Theorem 14. 11 is that it 
has the form of a group Lasso penalty as in equation (12.51 ), with groups not fixed a-priori but 
depending on the location of the vector (3. The groups are the elements of the partition J and 
are identified by certain convex constraints on the vector (3. For example, for n = 2 we obtain 
thatfi(0|W) = ||/3||iif \pi\ > I /3 a I andn(f3\W) = V2\\0\\ 2 otherwise. Forn = 3, we have that 

i- 



n((3\w) 



if > |&| 


> Iftl 


J = 


{{1},{2},{3}} 


if \Px\<m 


and /3? +^ > /3 3 2 




{{1,2}, {3}} 


if |&| < 


and /3f > ^1 


^ = 


{{1},{2,3}} 


otherwise 






{{1,2,3}} 



where we have also displayed the partition J involved in each case. We also present a graphical 
representation of the corresponding unit ball in Figure [3]-a. For comparison we also graphically 
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(a) (b) (c) (d) (e) 



Figure 3: Unit ball of different penalty functions: (a) Wedge penalty f)(-|W); (b) hierarchi- 
cal group Lasso; (c) group Lasso with groups {{1,2}, {3}}; (d) group Lasso with groups 
{{1}, {2, 3}}; (e) the penalty n(-\W 2 ). 



display the unit ball for the hierarchical group Lasso with groups {1, 2, 3}, {2, 3}, {3} and two 
group Lasso in Figure [3]-b,c,d, respectively. 

The wedge may equivalently be expressed as the constraint that the difference vector D 1 ( A) : = 
(Aj + i — Xj : j E N n _i) is less than or equal to zero. This alternative interpretation suggests the 
k-th order difference operator, which is given by the formula 



The associated penalty f2(-|iy fc ) encourages vectors whose sparsity pattern is concentrated on 
at most k different contiguous regions. Note that W 1 is not the wedge W considered earlier. 
Moreover, the 2-wedge includes vectors which have a convex "profile" and whose sparsity 
pattern is concentrated either on the first elements of the vector, on the last, or on both. 



In this section we present an extension of the wedge set which is inspired by previous work on 
the group Lasso estimator with hierarchically overlapping groups [32]. It models vectors whose 
magnitude is ordered according to a graphical structure. 

Let G = (V, E) be a directed graph, where V is the set of n vertices in the graph and 
E C V x V is the edge set, whose cardinality is denoted by m. If (v, w) E E we say that there 
is a directed edge from vertex v to vertex w. The graph is identified by the m x n incidence 
matrix, which we define as 




and the corresponding A;-th wedge 



W k := {X : X E E" , D k (X) > 0}. 



(4.10) 



5 Graph penalty 



A, 




1, if e = (v,w) E E, w EV, 
1, if e = (w,v) E E, w EV, 
0, otherwise. 
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We consider the penalty || • \\\ G for the convex cone A G = {A : A G A\ > 0} and assume, 
from now on, that G is acyclic (DAG), that is, G has no directed loops. In particular, this implies 
that, if (v, w) G E then (w,v) ^ E. The wedge penalty described above is a special case of 
the graph penalty corresponding to a line graph. Let us now discuss some aspects of the graph 
penalty for an arbitrary DAG. As we shall see, our remarks lead to an explicit form of the graph 
penalty when G is a tree. 

If (v , w) G E we say that vertex w is a child of vertex v and v is a parent of w. For every 
vertex v G V, we let C(v) and P(v) be the set of children and parents of v, respectively. When 
G is a tree, P(v) is the empty set if v is the root node and otherwise P(v) consists of only one 
element, the parent of v, which we denote by p(v). 

Let D(v) be the set of descendants of v, that is, the set of vertices which are connected 
to v by a directed path starting in v, and let A(v) be the set of ancestors of v, that is, the set 
of vertices from which a directed path leads to v. We use the convention that v G D(v) and 
v £ A(v). 

Every connected subset V C V induces a subgraph of G which is also a DAG. If V% and V 2 
are disjoint connected subsets of V, we say that they are connected if there is at least one edge 
connecting a pair of vertices in V\ and V 2 , in either one or the other direction. Moreover, we say 
that V% is below V\ — written V 2 JJ- V\ — if V\ and V 2 are connected and every edge connecting 
them departs from a node of V\ . 

Definition 5.1. Let G be a DAG. We say that C C E is a cut of G if it induces a partition 
= {V? '■ ^ e ^k} of the vertex set V such that (v, w) G C if and only if vertices v and w 
belong to two different elements of the partition. 

In other words, a cut separates a connected graph in two or more connected components 
such that every pair of vertices corresponding to a disconnected edge, that is an element of C, 
are in two different components. We also denote by C(G) the set of cuts of G, and by D e (v) the 
set of descendants of v within set V e , for every v G Vf and £ G 

Next, for every C G C(G), we define the regions in IR n by the equations 

Oc = C\{Qv x ,v 2 : V 1 ,V 2 eV(C),V 2 \^V 1 } (5.1) 

and 

Ic = C]{Q Deiv)>Vt :£en k ,veVt}. (5.2) 

These sets are the graph equivalent of the sets defined by equations (14.21) and (14.31) in the special 
case of the wedge penalty in Section 0] We also set Pq = Oc H Ic- 
Moreover, for every C G C(G), we define the sets 

U c :=P c f](R\{0}) n . 

As of yet, we cannot extend Theorem 14.11 to the case of an arbitrary DAG. However, we can 
accomplish this when G is a tree. 

Lemma 5.1. Let G = (V, E) be a tree, let A be the associated incidence matrix and let z = 
(z v : v G V) G M™. The following facts are equivalent: 
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(a) For every v G V it holds that 

u£D(v) 

(b) The linear system A T a = —z admits a non-negative solution for a = (a e : e G E) G M. m . 
Proof. The incident matrix of a tree has the property that, for every v G V and e <E E, 

^ ] ^ew — — 3e,(p(v),v) (5-3) 

where, for every e, e' G £7, 5 eie ' = 1 if e = e' and zero otherwise. The linear system in (b) can 
be written componentwise as 

Summing both sides of this equation over u G D(v ) and using equation (15.31) , we obtain the 
equivalent equations 

u£D(v) 

The result follows. ■ 

Definition 5.2. Let G = (V, E) be a DAG. For every vector (3 G (M\{0}) n and every cut 
C G C{G) we let V{C) = {Vi : £ G N k }, k G N n be the partition ofV induced by C, and define 
two vectors (((3, C) G R™ -1 and 5(f3, C) G The components of (((3, C) are given as 



Ce(/3,C) 



0, if e G C, 



\ v A du f -\ D t( u )l He={u,v),ueV e ,veDt{u), £eN k 

\\l J \Vf\\2 



whereas the components ofS((3, C) are given by 

5 V (J3, C) = veV e ,£e N fc . (5.4) 

VI Vf| 

Note that the notation we adopt in this definition differs from that used in the case of line 
graph, given in Definition 14.3 1 However, Definition l5.2l leads to a more appropriate presentation 
of our results for a tree. 

Proposition 5.1. Let G = (V, E) be a tree and A the associated incidence matrix. For every 
(3 G (M\{0}) n and every cut C G C(G) we have that 

(a) f3 G P c if and only if ((/3,C) > 0, A5((3,C) > and 5 v (f3,C) > 5 w (f3,C), for all 
v eV u we V 2 , 0, w) G E, V u V 2 G V(C); 

(b) If5(/3, Ci) = 5(/3, C 2 ) and /3 g Cl n Ca ^en d = C 2 . 
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Proof. We immediately see that f3 G Oc if and only if A5(f3, C)>0 and 5 v (t3, C) > 5 W ((3, C) 
for all v G Vi,w G V 2 , (t>,u>) G Vi,V 2 G V(C). Moreover, by applying Lemma I5T1 on 
each element Ve of the partition induced by C and choosing z = (\Vj>\ * \f — 1 : ^ G Ve), we 

conclude that £(/3, C) > if and only if j3 G Jc- This proves the first assertion. 

The proof of the second assertion is a direct consequence of the fact that the vector 5(f3, C) 
is a constant on any element of the partition V(C) and strictly decreasing from one element to 
the next in that partition. ■ 

Theorem 5.1. Let G = (V, E) be a tree. The collection of sets U := {Uc ■ C G C(G)} form 
a partition o/(R\{0}) n . Moreover, for every (3 G (M\{0})™ there is a unique C G C(G) such 
that 

M\ao= E VlWll/VJa (5-5) 

v t eV(C) 

and the vector A(/3) = (X v (f3) : v G V) has components given by X v (f3) = fie, v G Ve, £ G 
where 



fii 



Proof. The proof of this theorem proceeds in a fashion similar to that of Theorem 14. 1[ In this 
regard, Lemma I5T1 is crucial. By KKT theory (see e.g. [0, Theorems 3.3.4,3.3.7]), A is an 
optimal solution of the graph penalty if and only if there exists a > such that, for every v G V 



PI 



+ 1 - a e A ™ = 



and the following complementary conditions hold true 

ai(v,w)(^w - K) = 0, v G V, w G C(v). (5.7) 
We rewrite the first equation as 

(3 2 

a (p(v),v) — /J «(«,«() — t| — 1- (5-8) 

iuec(ii) v 

Now, if A G Ac solves equations (15.71) and (15.81) . then it induces a cut C C E and a correspond- 
ing partition V(C) = {V? : ^ £ N^} of V such that A„ = fie for every t> G Ve. That is, A„ = X w 
for every t> , w G Ve, £ G f%, and ct e = for every e G C. Therefore, summing equations (15.81) 
for f G Ve we get that 

ll<V,lh 

Moreover, since fie > fiq, if V q 4 V? we see that /3 G Oc. Next, for every i G Nfc and u £ Ve we 
sum both sides of equation (15.81) for t> G De(u) to obtain that 

a(p(u),u) = 2 IA?0)|- ( 5 - 9 ) 

fl e 
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We see that /3 G Ic and conclude that (3 G JJ C . 

In summary we have shown that the collection of sets IA cover (R\{0}) n . Next, we show 
that the elements of U are disjoint. To this end, we observe that, the computation described 
above can be reversed. That is to say, conversely for any partition C = {Vi : i E N k } of V and 
j3 G Uc we conclude by Proposition 15.11 that the vectors 5(j3, C) and ((/3, C) solves the KKT 
optimality conditions. Since this solution is unique if f3 G Uc x H Uc 2 then it must follow that 
8(p, C x ) = 8(ft, C 2 ), which implies that d = C 2 . ■ 

Theorems 14. 1 1 and | 5 . 1 1 fall into the category of a set A C W 1 chosen in the form of a polyhe- 
dral cone, that is 

A = {A : A G R n , AX > 0} 

where A is an m x n matrix. Furthermore, in the line graph of Theorem 14.11 and also the 
extension in Theorem 15.11 the matrix A only has elements which are —1,1 or 0. These two 
examples that we considered led to explicit description of the norm || • ||a- However, there are 
seemingly simple cases of a matrix A of this type where the explicit computation of the norm 
|| ■ ||a seem formidable, if not impossible. For example, if m = 2, n = 4 and 

r-i -i i o- 
L o —l—ii 

we are led by KKT to a system of equations that, in the case of two active constraints, that is, 
AX = 0, are the common zeros of two fourth order polynomials in the vector A G M 2 . 

6 Duality 

In this section, we comment on the utility of the class of penalty functions considered in this 
paper, which is fundamentally based on their construction as constrained infimum of quadratic 
functions. To emphasize this point both theoretically and computationally, we discuss the con- 
version of the regularization variational problem over j3 G M. n , namely 

£(A) = inf {£(/?, A) : (3 G R n ,X G A} (6.1) 

where 

E(f3,X):= || 2 /-X/3|| 2 + 2 P r(/3,A), 

into a variational problem over A G A. 

To explain what we have in mind, we introduce the following definition. 

Definition 6.1. For every X G M™, we define the vector (3(X) G M. n as 

/3(A) = di a g(A)M(A)X T y 

where M(A) := (diag(A)X T X + pi)- 1 . 

Note that /3(A) = argmin{£(/3, A) : (3 G R n }. 
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Theorem 6.1. For p > 0, y G R"\ a/ry m x n matrix X and any nonempty convex set A we 
have that 

S(A) = min {py T (Xdiag(A)X T + piy 1 y + ptr(diag(A)) : A G A n R™ | (6.2) 

Moreover, ifX is a solution to this problem, then /3(A) is a solution to problem (16.11) . 

Proof. We substitute the formula for £7(/5| A) into the right hand side of equation (16.11) to obtain 
that 

£(A) = inf {H(X) : A G A} (6.3) 

where we define 

H(X) = min {£(/?, A) : (3 G R n } . 
A straightforward computation yields that 

H{\) = py T (Xdiag(A)X T + piy 1 y + ptr(diag(A)). 

Since H(X) > ptr(diag(A)), we conclude that any minimizing sequence for the optimization 
problem on the right hand side of equation (|6.3I) must have a subsequence which converges. 
These remarks confirm equation (16.21) . 

We now prove the second claim. For A G R™ + a direct computation confirms that 

r(/3(A), X) = l - ( 2 / T XM(A)diag(A)M(A)X T y + tr(diag(A))) . 

Note that the right hand side of this equation provides a continuous extension of the left hand 
side to A G R+. For notational simplicity, we still use the left hand side to denote this continuous 
extension. 

By a limiting argument, we conclude, for every A G A, that 

fi(/3(A)|A)<r(/3(A),A). (6.4) 

We are now ready to complete the proof of the theorem. Let A be a solution for the optimization 
problem (16.21) . By definition, it holds, for any j3 G R™ and A G A, that 

\\y - Xf3(X)\\l + 2pT(/3(A), A) = H(X) < H(X) < \\y - Xf3\\ 2 2 + 2pT(f3, A). 

Combining this inequality with inequality (16.41) evaluated at A = A, we conclude that 

\\y - X(3(X)\\l + 2 P n(/3(X)\A) < \\y - Xf3g + 2pV(/3, A) 

from which the result follows. ■ 

An important consequence of the above theorem is a method to find a solution f3 to the 
optimization problem (16.11) from a solution to the optimization problem (16.21) . We illustrate this 
idea in the case that X = I. 
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Corollary 6.1. It holds that 



min - y\\l + 2pO(/3|A) : /3 G M"} = pmhJ V — ^— + A»:AgA . (6.5) 

Un„ A * + P J 

Moreover, if X is a solution of the right optimization problem then the vector /3(A) = (A(A) : 
i G N n ), whose components are defined for i <EN n as 

A (A) = i^- (6.6) 
A, + p 

z* a solution of the left optimization problem problem. 

We further discuss two choices of the set A in which we are able to solve problem (16.51) 
analytically. The first case we consider is A = M+ + , which corresponds to the Lasso penalty. It 
is an easy matter to see that A = {\y \ — p) + and the corresponding regression vector is obtained 
by the well-known "soft thresolding" formula /3(A) = (\y\ — p) + sign(y). The second case is 
the Wedge penalty. We find that the solution of the optimization problem in the right hand side 
of equation (|6.5I) is A = (X(y) — p) + , where X(y) is given in Theorem 14. 1[ Finally, we note 
that Corollary 16. II and the example following it extend to the case that X T X = I by replacing 
throughout the vector y by the vector X T y. In the statistical literature this setting is referred to 
as orthogonal design. 



7 Optimization method 

In this section, we address the issue of implementing the learning method (12.21) numerically. 

Since the penalty function f2(-|A) is constructed as the infimum of a family of quadratic 
regularizers, the optimization problem (12.21) reduces to a simultaneous minimization over the 
vectors (3 and A. For a fixed A G A, the minimum over /3 G W l is a standard Tikhonov 
regularization and can be solved directly in terms of a matrix inversion. For a fixed /3, the 
minimization over A G A requires computing the penalty function (I2.3I ). These observations 
naturally suggests an alternating minimization algorithm, which has already been considered 
in special cases in 111]. To describe our algorithm we choose e > and introduce the mapping 
<fi e : W 1 — > whose z-th coordinate at /3 G M" is given by 

For /3 G (M\{0}) n , we also let A(/3) = argmin{r(/3, A) : A G A}. 

The alternating minimization algorithm is defined as follows: choose A G A and, for k G N, 
define the iterates 

f3 k = (3{X k - 1 ) (7.1) 
X k = A(^(/3 fc )). (7.2) 

The following theorem establishes convergence of this algorithm. 
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Theorem 7.1. If the set A is admissible in the sense of Definition 12. 1\ then the iterations ( 17.il) - 
( 17.21) converges to a vector 7(e) such that 

7 (e) = argmin { \\y -Xf3\\j + 2pQ(^(f3)\A) : f3 G R n } . 

Moreover, any convergent subsequence of the sequence {7 (4) : £ G N} converges to a solution 
of the optimization problem (12.21) . 

Proof. We divide the proof into several steps. To this end, we define 

E e ((3,\): = \\y-Xf3\\l + 2pT(f(f3),\) 

and note that f3(\) = argmin{E e (a, A) : a G W 11 }. 

Step 1. We define two sequences, 6 k = E e (/3 k , A fc_1 ) and v k = E e (/3 k , X k ) and observe, for 
any k > 2, that 

v k <9 k < Vk-\- (7.3) 

These inequalities follow directly from the definition of the alternating algorithm, see equations 
(ED and (TA21) . 

5?ep 2. We define the compact set B = {(3 : /3 G K", < 6*i}. From the first inequality 
in Proposition [272] and inequality (17.31) we conclude, for every k G N, that f3 k G £>. 
3. We define the function g : IR n R at (3 G M n as 

g(/3) = min A(0 e (/3))) : a G M"} . 

We claim that g is continuous on B. In fact, there exists a constant k > such that, for every 
7\7 2 G B, it holds that 

I^T 1 )-^ 2 )! <^l|A(^(7 1 ))-A(^( 7 2 ))||oo. (7.4) 

The essential ingredient in the proof of this inequality is the fact that there exists constant a and 
b such that, for all G B, \(cf) e (/3)) G [a,6] n . This follows from the inequalities developed in 
the proof of Proposition [2T] 

Step 4. By step 2, there exists a subsequence {(5 kl : t G N} which converges to (3 G B and, 
for all G R n and A G A, it holds that 

E e 0, A(0 e (/3))) < E e (/3, A(0 e (/3))), E e 0, X(^0))) < E e 0, A). (7.5) 

Indeed, from step 1 we conclude that there exists ip G IR ++ such that 

lim 9k = lim v k = ip. 

k— >oo k— >oo 

Since, by Proposition ^. II \(6) is continuous for G (IR\{0})", we obtain that 

lim A*« = \{<p e 0))- 

£^00 

By the definition of the alternating algorithm, we have, for all (3 G M 71 and A G A, that 

9 k+1 = E e ((3 k+ \ \ k ) < E e ((3, \ k ), v k = E e ((3 k , \ k ) < E e ({3 k , A). 
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Initialization: k <— 

Input: f3 e W 1 ; Output: J u ...,J k 
for t — 1 to n do 

Jk+l +~ {t}', 

k <- k + 1 

while k > 1 and ^t=it < IMi 

jfe<- Jfe-l 

end 
end 



Figure 4: Iterative algorithm to compute the wedge penalty 

From this inequality we obtain, passing to limit, inequalities (17.51) . 

Step 5. The vector 0, A(0 e (/3))) is a stationary point. Indeed, since A is admissible, by step 
3, \(<f) e 0) G int(A). Therefore, since E e is continuously differentiable this claim follows from 
step 4. 

Step 6. The alternating algorithm converges. This claim follows from the fact that E € 
is strictly convex. Hence, E e has a unique global minimum in W 1 x A, which in virtue of 
inequalities (1731) is attained at 0, X((f) e 0))). 

The last claim in the theorem follows from the fact that the set {7(e) : e > 0} is bounded 
and the function A(/3) is continuous. ■ 

The most challenging step in the alternating algorithm is the computation of the vector A fe . 
Fortunately, if A is a second order cone, problem (12.31) defining the penalty function fi(- 1 A) may 
be reformulated as a second order cone program (SOCP), see e.g. (g]. To see this, we introduce 
an additional variable tGM n and note that 

fi(/3|A)=mhJ J^ + A,: 11(2/3^ - A*)^ < U + KU > 0, i G N n , A G A 

UeN„ 

In particular, the examples discussed in Sections @] and [5l the set A is formed by linear con- 
straints and, so, problem (12.31) is an SOCP. We may then use available tool-boxes to compute 
the solution of this problem. However, in special cases the computation of the penalty function 
may be significantly facilitated by using available analytical formulas. Here, for simplicity we 
describe how to do this in the case of the wedge penalty. For this purpose we say that a vector 
f3 G R n is admissible if, for every k G N n , it holds that H^^l^/Vfe < ||/3|| 2 /\/n. 

The proof of the next lemma is straightforward and we do not elaborate on the details. 

Lemma 7.1. If (5 G R n and 5 E MP are admissible and \\f3\\ 2 /^/n < \\5\\ 2 / ^ then 0,5) is 
admissible. 

The iterative algorithm presented in Figure 0] can be used to find the partition J = {J e : 
£ G Nfc} and, so, the vector \0) described in Theorem 14. 1[ The algorithm processes the 
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components of vector (3 in a sequential manner. Initially, the first component forms the only 
set in the partition. After the generic iteration t — 1, where the partition is composed of k sets, 
the index of the next components, t, is put in a new set Jk+i- Two cases can occur: the means 
of the squares of the sets are in strict descending order, or this order is violated by the last set. 
The latter is the only case that requires further action, so the algorithm merges the last two sets 
and repeats until the sets in the partition are fully ordered. Note that, since the only operation 
performed by the algorithm is the merge of admissible sets, Lemma ITTI ensures that after each 
step t the current partition satisfies the "stay within" conditions MljiM. > ^\-h\ K ^ 2 ^ f or ever y 

t E Nfc and every subset K C Je formed by the first k < \J e \ elements of Jg. Moreover, the 
while loop ensures that after each step the current partition satisfies, for every £ 6 Nfc_i, the 
"cross over" conditions ||/3ijJ|2\/| Jt\ > \\P\J e+1 \\2y/\Ji+i\- Thus, the output of the algorithm 
is the partition J defined in Theorem 14. 1[ In the actual implementation of the algorithm, the 
means of squares of each set can be saved. This allows us to compute the mean of squares of 
a merged set as a weighted mean, which is a constant time operation. Since there are n — 1 
consecutive terms in total, this is also the maximum number of merges that the algorithm can 
perform. Each merge requires exactly one additional test, so we can conclude that the running 
time of the algorithm is linear. 



8 Numerical simulations 

In this section we present some numerical simulations with the proposed method. For sim- 
plicity, we consider data generated noiselessly from y = X(3* , where 0* E M 100 is the true 
underlying regression vector, and X is an m x 100 input matrix, m being the sample size. The 
elements of X are generated i.i.d. from the standard normal distribution, and the columns of X 
are then normalized such that their £ 2 norm is 1. Since we consider the noiseless case, we solve 
the interpolation problem min{f2(/3) : y = X/3}, for different choices of the penalty function 
{}. In practice, (12.21 ) is solved for a tiny value of the parameter, for example, p = 10~ 8 , which 
we found to be sufficient to ensure that the error term in (12.21) is negligible at the minimum. All 
experiments were repeated 50 times, generating each time a new matrix X. In the figures we 
report the average of the model error of the vector learned by each method, as a function of the 
sample size m. The former is defined as ME 0) = E[\\0 — /3*\\l]/E[\\0*\\l]. In the following, we 
discuss a series of experiments, corresponding to different choices for the model vector (3* and 
its sparsity pattern. In all experiments, we solved the optimization problem (12.21) with the algo- 
rithm presented in Section|7J Whenever possible we solved step (17.21 ) using analytical formulas 
and resorted to the solver CVX ( http ://cvxr. com/cvxty in the other cases. For example, in the 
case of the wedge penalty, we found that the computational time of the algorithm in Figure |4] is 
495, 603, 665, 869, 1175 faster than that of the solver CVX for n = 100, 500, 1000, 2500, 5000, 
respectively. Our implementation ran on a 16GM memory dual core Intel machine. The MAT- 
LAB code is available at http ://www.cs.ucl.ac.Uk /staff/M.Pontil/software.html, 

Box. In the first experiment the model is 10- sparse, where each nonzero component, in a random 
position, is an integer uniformly sampled in the interval [—10, 10]. We wish to show that the 
more accurate the prior information about the model is, the more precise the estimate will be. 
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Figure 5: Comparison between different penalty methods: (a) Box vs. Lasso; (b,c) Wedge vs. 
Hierarchical group Lasso; (d) Composite wedge. See text for more information 



We use a box penalty (see Theorem I3.ll) constructed "around" the model, imagining that an 
oracle tells us that each component \/3*\ is bounded within an interval. We consider three boxes 
B[a, b] of different sizes, namely = (r — and bi = — r)+ and radii r = 5, 1 and 

0.001, which we denote as Box-A, Box-B and Box-C, respectively. We compare these methods 
with the Lasso - see Figure[5]-a. As expected, the three box penalties perform better. Moreover, 
as the radius of a box diminishes, the amount of information about the true model increases, 
and the performance improves. 

Wedge. In the second experiment, we consider a regression vector, whose components are 
nonincreasing in absolute value and only a few are nonzero. Specifically, we choose a 10- 
sparse vector: (3* = 11 — j, if j 6 Nio and zero otherwise. We compare the Lasso, which makes 
no use of such ordering information, with the wedge penalty f)(/3|W) (see Theorem 14.11 ) and 
the hierarchical group Lasso in Ii32ll . which both make use of such information. For the group 
Lasso we choose Q((3) = ^ eNl00 H/VJU wim Ji = 1, ■■■ > 10 0}> ^ e ^ioo- These two 

methods are referred to as "Wedge" and "GL-lin" in Figure [5]-b, respectively. As expected both 
methods improve over the Lasso, with "GL-lin" being the best of the two. We further tested 
the robustness of the methods, by adding two additional nonzero components with value of 10 
to the vector j3* in a random position between 20 and 100. This result, reported in Figure [5]-c, 
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indicates that "GL-lin" is more sensitive to such a perturbation. 

Composite wedge. Next we consider a more complex experiment, where the regression vector 
is sparse within different contiguous regions Pi, ... , P 10 , and the t\ norm on one region is larger 
than the l\ norm on the next region. We choose sets P = { 10(i — 1) + 1, ... , 10i}, i E N w 
and generate a 6-sparse vector (3* whose 2-th nonzero element has value 31 — i (decreasing) 
and is in a random position in P, for i E N 6 . We encode this prior knowledge by choosing 
Q(/3|A) with A = {A E M 100 : ||A P Ji > ||A Pi+1 ||i, % E N 9 }. This method constraints the sum 
of the sets to be nonincreasing and may be interpreted as the composition of the wedge set with 
an average operation across the sets p, which may be computed using Proposition 12.31 . This 
method, which is referred to as "C-Wedge" in Figure [5]-d, is compared to the Lasso and to three 
other versions of the group Lasso. The first is a standard group Lasso with the nonoverlapping 
groups J, = P, i E Nio, thus encouraging the presence of sets of zero elements, which is 
useful because there are 4 such sets. The second is a variation of the hierarchical group Lasso 
discussed above with Jj = Uj° jP/, i E Niq. A problem with these approaches is that the £2 
norm is applied at the level of the individual sets p, which does not promote sparsity within 
these sets. To counter this effect we can enforce contiguous nonzero patterns within each of the 
P, as proposed by Ill4ll . That is, we consider as the groups the sets formed by all sequences 
of q E N9 consecutive elements at the beginning or at the end of each of the sets p, for a 
total of 180 groups. These three groupings will be referred to as "GL-ind", "GL-hie", "GL- 
con" in Figure (5]-d, respectively. This result indicates the advantage of "C-Wedge" over the 
other methods considered. In particular, the group Lasso methods fall behind our method and 
the Lasso, with "GL-con" being slightly better than "GL-ind" and "GL-hie". Notice also that 
all group Lasso methods gradually diminish the model error until they have a point for each 
dimension, while our method and the Lasso have a steeper descent, reaching zero at a number 
of points which is less than half the number of dimensions. 

Polynomials. The constraints on the finite differences (see equation (14.101) ) impose a structure 
on the sparsity of the model. To further investigate this possibility we now consider some mod- 
els whose absolute value belong to the sets of constraints W k , where k = 1, . . . , 4. Specifically, 
we evaluate the polynomials pi(t) = -(t+5),p 2 (t) = (t+Q)(t-2),p 3 (t) = — (t+6.5)*(t— 1.5) 
andp 4 (t) = (t + 6.5)(t - 2.5) (t + l)t at 100 equally spaced (0.1) points starting from -7. We 
take the positive part of each component and scale it to 10, so that the results can be seen in 
Figure U\ The roots of the polynomials has been chosen so that the sparsity of the models is 
either 18 or 19. 

We solve the interpolation problem using our method with the penalty Vt(f3\W k ), k = 
1, . . . ,4, with the objective of testing the robustness of our method: the constraint set W k 
should be a more meaningful choice when \/3*\ is in it, but the exact knowledge of the degree 
is not necessary. This is indeed the case: "W-k" outperforms the Lasso for every k, but among 
these methods the best one "knows" the degree of \/3*\. For clarity, in Figures |6] we included 
only the best method. 

One important feature of these sparsity patterns is the number of contiguous regions: 1, 2, 
2 and 3 respectively. This prior information cannot be exploited with convex optimization tech- 
niques, so we tested our method against StructOMP, proposed by [11], a state of the art greedy 
algorithm. It relies on a complexity parameter which depends on the number of contiguous re- 
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Figure 6: Comparison between StructOMP and penalty Vt(f3\W k ), k = 1, . . 
polynomial models: (a) degree 1, (6) degree 2, (c) degree 3; (d) degree 4. 



4, used for several 



gions of the model, and which we provide exactly to the algorithm. The performance of "W-k" 
is comparable or better than StructOMP. 

As a way of testing the methods on a less artificial setting, we repeat the experiment using 
the same sparsity patterns, but replacing each nonzero component with a uniformly sampled 
random number between 1 and 2. In Figure[8]we can see that, even if now the models manifestly 
do not belong to W k , we still have an advantage because the constraints look for a limited 
number of contiguous regions. We found that in this case StructOMP has difficulties, probably 
due to the randomness of the model. 

Finally, Figure [9] displays the regression vector found by the Lasso and the vector learned 
by "W-2" (left) and by the Lasso and "W-3" (right), in a single run with sample size of 15 and 
35, respectively. The estimated vectors (green) are superposed to the true vector (black). Our 
method provides a better estimate than the Lasso in both cases. We found that the estimates of 
StructOMP are too variable for it to be meaningful to include one of them here. 
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(a) 




(c) (d) 

Figure 7: Silhouette of the polynomials by number of degree: (a) k — 1, (6) A; 

(d) k = 4. 



2, (c) k = 3, 



9 Conclusion 

We proposed a family of penalty functions that can be used to model structured sparsity in 
linear regression. We provided theoretical, algorithmic and computational information about 
this new class of penalty functions. Our theoretical observations highlight the generality of this 
framework to model structured sparsity. An important feature of our approach is that it can 
deal with richer model structures than current approaches while maintaining convexity of the 
penalty function. Our practical experience indicates that these penalties perform well numeri- 
cally, improving over state of the art penalty methods for structure sparsity, suggesting that our 
framework is promising for applications. 

The methods developed here can be extended in different directions. We mention here 
several possibilities. For example, for any r > 0, it readily follows that 



inf 



X e 




(9.1) 



where p = 2r/(r + 1) and \\j3\\ p is the usual £ p -norm on IR n . This formula leads us to consider 
the same optimization problem over a constraint set A. Note that if p — > the left hand side of 
the above equation converges to the cardinality of the support of the vector (3. 

Problems associated with multi-task learning yj, |2J] demand matrix analogs of the results 
discussed here. In this regard, we propose the following family of unitarily invariant norms on 
d x n matrices. Let k = mm(d, n) and <j(B) G be the vector formed from the singular 
values of B. When A is a nonempty convex set which is invariant under permutations our point 
of view in this paper suggests the penalty 



\B\\ K = n{a{B)\K). 
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Figure 8: Comparison between StructOMP and penalty Vt(f3\W k ), k = 1, . . . , 4, used for several 
polynomial models with random values between the roots: (a) degree I, (b) degree 2, (c) degree 
3; (d) degree 4. 



The fact that this is a norm, follows from the von Neumann characterization of unitarily invariant 
norms. When A = this norm reduces to the trace norm |2D. 

Finally, the ideas discussed in this paper can be used in the context of kernel learning, see 



finally. 



20, |25D and references therein. Let K e , £ e N n be prescribed reproducing kernels 
on a set X, and Hi the corresponding reproducing kernel Hilbert spaces with norms || • ||^. We 
consider the problem 



mm 



i6N m \ teN„ 




and note that the choice A = M™ + corresponds to multiple kernel learning. 

All the above examples deserve a detailed analysis and we hope to provide such in future 
work. 
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Figure 9: Lasso vs. penalty fi(-|A) for Convex (left) and Cubic (Right); see text for more 
information. 
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A Appendix 

In this appendix we describe in detail a result due to J.M. Danskin, which we use in the proof 
of Proposition 12.11 

Definition A.l. Let f be a real-valued function defined on an open subset X ofW 1 and u G 
M. n . The directional derivative of f at x G X in the "direction" u is denoted by (D u f)(x) and 
is defined as 

(A,/)M :=lim /(j + ft ' ) - /W 
t^o t 

if the limit exists. When the limit is taken through nonnegative values oft, we denote the corre- 
sponding right directional derivative by D+. 

Let Y be a compact metric space, F : X x Y — > K a continuous function on its domain and 
define the function / iX-^RatieXas 

f(x) = min {F(x, y) : y G Y} . 

We say that F is Danskin function if, for every u G W 1 , the function F' u : X x Y — > M. defined 
at (x, y) elxF as F^(x, y) = (D U F(-, y))(x) is continuous onlxF. Our notation is meant 
to convey the fact that the directional derivative is taken relative to the first variable of F. 
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Theorem A.l. If X is an open subset of W 1 , Y a is compact metric space, F : X x Y is a 
Danskin function, ueK" and x G X, then 

(D+f)(x) = min{F^x,y):yeY x } 

where Y x :={y:y€Y, F{x,y) = f(x)}. 

Proof. If x G X, y G Y x and hgR" then, for all positive t, sufficiently small, we have that 

f{x + tu) - f(x) F(x + tu,y) -F(x,y) 
t t 

Letting t — > + , we get that 

l imsup f(x + tu)-f(x) ^ ^ {K ^ y) ye (A l) 

t^o+ t 

Next, we choose a sequence {tk : k E N} of positive numbers such that lim^oo t/- — and 
Hm f(x + t k u)-f(x) = . nf f{x + tu)-f{x) ^ 

k-^oo tk t-5>0+ t 

From the definition of the function /, there exists a yt G Y such that f(x + t k u) = F(x + 
tkUiVk)- Since Y is a compact metric space, there is a subsequence {y ke : i G N} which 
converges to some y^ G Y. It readily follows from our hypothesis that the function / is 
continuous on X. Indeed, we have, for every x±, X2 G X, that 

|/(rci)-/(z2)| <m&x{\F(x u y)- F(x 2 ,y)\ ■ y G Y} . 

Hence we conclude that y^ G Y x . Moreover, we have that 

f(x + t k u) -f(x) Fjx + t k u, y k ) - F(x, y k ) 
tk t k 

By the mean value theorem, we conclude that there is positive number Ok < tk such that the 

f(x + t k u) -f(x) 

7 > F u (x + a k u, y k ). 

We let i — > oo and use the hypothesis that F is a Danskin function to conclude that 

f f(x + tu)-f(x) ^ pv ( j > min{F ^ Xjy) :y eY x }. 

t->0+ t 

Combining this inequality with (IA.1I) proves the result. ■ 

We note that |4j, p. 737] describes a result which is attributed to Danskin without reference. 
That result differs from the result presented above. The result in [4, p. 737] requires the hy- 
pothesis of convexity on the function F. The theorem above and its proof is an adaptation of 
Theorem 1 in [8]. 
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We are now ready to present the proof of Proposition ^. II 

Proof of Proposition 12.11 The essential part of the proof is an application of Theorem lA.il To 
apply this result, we start with a j3 G (R\{0}) n and introduce a neighborhood of this vector 
defined as 

X(/9) = |a:aeA,||a-/3|| 0O <^ 

where /3 min = min{|/3j| : i G N„}. Theorem lA.il also requires us to specify a compact subset 
Y(f3) of W l . We construct this set in the following way. We choose a fixed A G A and a positive 
e > 0. From these constants we define the constants 

cGS) = WOT + +X 



a(P) 



: 



2 

min 



4(c(/3) + e)' 

b(P) = max(a(/3),c(/3) + e). 

With these definitions, we choose our compact set Y(/3) to be Y(/3) = ^ a {0),b{8)- To apply 
Theorem lA.il we use the fact, for any a G X([3), that 

Q(a\A) = min{r(a, A) : A G Y(f3)}. (A.2) 

Let us, for the moment, assume the validity of this equation and proceed with the remaining 
details of the proof. As a consequence of this equation, we conclude that there exists a vector 
A(/3) suchthatfi(/3|A) = T(/3, A(/3)). Moreover, when (3 G (M\{0}) n the function^ : W\ + -»■ 
R, defined for A G as ^(A) = T(/3, A) is strictly convex on its domain and so, A(/3) is 

unique. 

By construction, we know, for every a G X(j3), that 



o(g) + 6(g) 
A,- a 



: <6H . > <gW±M. 



max 

From this inequality we shall establish that A(/3) depends continuously on (3. To this end, we 
choose any sequence {[3 k : k G N} which converges to (3 and from the above inequality we 
conclude that the sequence of vectors \(/3 k ) is bounded. However this sequence can only have 
one cluster point, namely A(/3), because T is continuous. Specifically, if lim^oo X(/3 k ) = A, 
then, for every A G A, it holds that T(f3 k , \((3 k )) < T((3 k , A) and, passing to the limit F(/3, A) < 
r(/3, A), implying that A = A(/3). 

Likewise, equation (IA.2I ) yields the formula for the partial derivatives of fi(-|A). Specifi- 
cally, we identify F and / in Theorem lA.il with T and A), respectively, and note that 

<m ((3\A) = min (|^(/3,A) : A G A, T(/3, A) = fi(/3|A)l = A(/3)) = 2- /j ' 
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Therefore, the proof will be completed after we have established equation (IA.2I) . To this 
end, we note that if A = (A* : i G N n ) G A\F(/3) then there exists j G N n such that either 
Aj < a(/3) or Aj > Thus, we have, for every a G X(f3), that 

rKA) > i g + A,) > i m ,n (§|,6(«) = ^ > Sl(a|A) + 1. 

This inequality yields equation (IA.2I) . ■ 

We end this appendix by extracting the essential features of the convergence of the alternat- 
ing algorithm as described in Sectionf?! We start with two compact sets, ICR" and F C R m , 
and a strictly convex function F : X x Y — > R. Corresponding to F we introduce two additional 
functions, / : X — y R and (7 : F — > R defined, for every a; G X, y G F as 

f(x) = mxD.{F(x,y') : j/ G F}, = min{F(a; / , y) : x' G X}. 

Moreover, we introduce the mappings (f>i :Y —t X and <p2 ■ X — > F, defined, for every x E X, 
y G F, as 

0i (?/) = argmin{F(a;, : x G X}, 02 (^) = argmin{F(x, y) : y G F}. 
Lemma A.l. T/ze mappings 0i anJ 02 are continuous on their respective domain. 

Proof. We prove that (pi is continuous. The same argument applies to 2 . Suppose that {y k : 
k G N} is a sequence in F which converges to some point y G F. Then, since F is jointly 
strictly convex, the sequence {<fii(y k ) : k E N} has only one cluster point in X, namely (f>i(y). 
Indeed, if there is a subsequence {4>i(y ke );£ E N} which converges to x, then by definition, 
we have, for every x E X, £ E N, that F((pi(y kl ),y ke ) < F(x,y ke ). From this inequality it 
follows that F(x, y) < F(x, y). Consequently, we conclude that x = (f>\(y). Finally, since X is 
compact, we conclude that the lim^oo 0i (y k ) = 0i (y) . ■ 

As an immediate consequence of the lemma, we see that / and g are continuous on their 
respective domains, because, for every x E X, y E F, we have that f(x) = F(x, 02^)) and 
g(y) = F(<j> 1 (y),y). 

We are now ready to define the alternating algorithm. 

Definition A.2. Choose any y E int(F) and, for every k G N, define the iterates 

x k = My k - r ) 

and 

y k = M* k )- 

Theorem A.2. If F : X x F — > R satisfies the above hypotheses and it is differentiable on the 
interior of its domain, and there are compact subsets X§ C int(X), Yq C int(F) such that, for 
all k E N, (x k ,y k ) E Xq x Yq, then the sequence {(x k ,y k ) : k E N} converges to the unique 
minimum of F on its domain. 
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Proof. First, we define, for every k e N, the real numbers 9k = F(x k ,y k x ) and v k = 
F(x k , y k ). We observe, for all k > 2, that 

Therefore, there exists a constant tp such that lim^oo 9 k = lim^oo z/ fc = ip. Suppose, there is 
a subsequence {x ke :f GN} such that Hindoo x ke = x. Then lim^oo 4> 2 (x kl ) = ^{x) =: y. 
Observe that v k = f(x k ) and 9 k+ i = g(y k )- Hence we conclude that 

f(x) = g{y) = i>- 

Since F is differentiable, (x, y) is a stationary point of F in int(X) x int(F). Moreover, since 
F is strictly convex, it has a unique stationary point which occurs at its global minimum. ■ 
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